---
title: Apache DataFu Pig - Getting Started
version: 1.6.1
section_name: Getting Started
license: >
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
---

# DataFu Pig

Apache DataFu Pig is a collection of user-defined functions for working with large scale data in [Apache Pig](http://pig.apache.org/).  It has a number of useful functions available:

<div class="row">
  <div class="col-lg-6">
    <h4>Statistics</h4>
    <p>
      Compute quantiles, median, variance, wilson binary confidence, etc.
    </p>

    <h4>Set Operations</h4>
    <p>
      Perform set intersection, union, or difference of bags.
    </p>

    <h4>Bags</h4>
    <p>
      Convenient functions for working with bags such as enumerate items,
      append, prepend, concat, group, distinct, etc.
    </p>

    <h4>Sessions</h4>
    <p>
      Sessionize events from a stream of data.
    </p>
  </div>

  <div class="col-lg-6">
    <h4>Estimation</h4>
    <p>
      Streaming implementations that can estimate
      quantiles and median.
    </p>

    <h4>Sampling</h4>
    <p>
      Simple random sampling with or without replacement,
      weighted sampling.
    </p>

    <h4>Link Analysis</h4>
    <p>
      Run PageRank on a graph represented by a bag of
      nodes and edges.
    </p>

    <h4>More</h4>
    <p>
      Other useful methods like Assert and Coalesce.
    </p>
  </div>
</div>

If you'd like to read more details about these functions, check out the [Guide](/docs/datafu/guide.html).  Otherwise if you are
ready to get started using DataFu Pig, keep reading.

The rest of this page assumes you already have a built JAR available.  If this is not the case, please see the [Download](/docs/download.html) page.

## Basic Example: Computing Median

Let's use DataFu Pig to perform a very basic task: computing the median of some data.
Suppose we have a file `input` in Hadoop with the following content:

    1
    2
    3
    2
    2
    2
    3
    2
    2
    1

We can clearly see that the median is 2 for this data set.  First we'll start up Pig's grunt shell by running `pig` and
then register the DataFu JAR:

```pig
register datafu-pig-<%= current_page.data.version %>.jar
```

To compute the median we'll use DataFu's `StreamingMedian`, which computes an estimate of the median but has the benefit
of not requiring the data to be sorted:

```pig
DEFINE Median datafu.pig.stats.StreamingMedian();
```

Next we can load the data and pass it into the function to compute the median:

```pig
data = LOAD 'input' using PigStorage() as (val:int);
data = FOREACH (GROUP data ALL) GENERATE Median(data);
DUMP data
```

This produces the expected output:

    ((2.0))

## Next Steps

Check out the [Guide](/docs/datafu/guide.html) for more information on what you can do with DataFu Pig.